My Details:

About the competition

Real estate investment

is one best type of investment becouse the statble,The benefits of investing in real estate are numerous With well-chosen assets, investors can enjoy predictable cash flow, excellent returns, tax advantages, and diversification—and it's possible to leverage real estate to build wealth. Real estate investors make money through rental income, any profits generated by property-dependent business activity, and appreciation. Real estate values tend to increase over time, and with a good investment, you can turn a profit when it's time to sell. Rents also tend to rise over time, which can lead to higher cash flow. This chart from the Federal Reserve Bank of St. Louis shows average home prices in the U.S. since 1963. The areas shaded in grey indicate U.S. recessions

Our goal:

our main goal is developing an algorithm that best predicts House Prices, allowing the real estate company decide on the best prices to practice in the market, bringing agility and a robust system behind it.

Visualization:

In [65]:
# update plotly and pandas_profiling version
!pip install --upgrade plotly
!pip install sweetviz
Requirement already up-to-date: plotly in /usr/local/lib/python3.6/dist-packages (4.14.1)
Requirement already satisfied, skipping upgrade: retrying>=1.3.3 in /usr/local/lib/python3.6/dist-packages (from plotly) (1.3.3)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from plotly) (1.15.0)
Requirement already satisfied: sweetviz in /usr/local/lib/python3.6/dist-packages (2.0.4)
Requirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.6/dist-packages (from sweetviz) (2.11.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.6/dist-packages (from sweetviz) (1.4.1)
Requirement already satisfied: matplotlib>=3.1.3 in /usr/local/lib/python3.6/dist-packages (from sweetviz) (3.2.2)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in /usr/local/lib/python3.6/dist-packages (from sweetviz) (1.1.5)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from sweetviz) (1.19.4)
Requirement already satisfied: tqdm>=4.43.0 in /usr/local/lib/python3.6/dist-packages (from sweetviz) (4.54.1)
Requirement already satisfied: importlib-resources>=1.2.0 in /usr/local/lib/python3.6/dist-packages (from sweetviz) (3.3.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.11.1->sweetviz) (1.1.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.1.3->sweetviz) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.1.3->sweetviz) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.1.3->sweetviz) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.1.3->sweetviz) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2018.9)
Requirement already satisfied: zipp>=0.4; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from importlib-resources>=1.2.0->sweetviz) (3.4.0)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from cycler>=0.10->matplotlib>=3.1.3->sweetviz) (1.15.0)
In [66]:
# import numpy, matplotlib, etc.
import math
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

# sklearn imports
from sklearn import metrics
from sklearn import pipeline
from sklearn import linear_model
from sklearn import preprocessing
from sklearn import neural_network
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

Data Understanding

In [67]:
train_df = pd.read_csv("/content/train.csv")
test_df = pd.read_csv("/content/test.csv")
test_id = test_df["Id"]
In [68]:
display(train_df)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating ... CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA ... Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA ... Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA ... Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA ... Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA ... Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 5 1999 2000 Gable CompShg VinylSd VinylSd None 0.0 TA TA PConc Gd TA No Unf 0 Unf 0 953 953 GasA ... Y SBrkr 953 694 0 1647 0 0 2 1 3 1 TA 7 Typ 1 TA Attchd 1999.0 RFn 2 460 TA TA Y 0 40 0 0 0 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub Inside Gtl NWAmes Norm Norm 1Fam 1Story 6 6 1978 1988 Gable CompShg Plywood Plywood Stone 119.0 TA TA CBlock Gd TA No ALQ 790 Rec 163 589 1542 GasA ... Y SBrkr 2073 0 0 2073 1 0 2 0 3 1 TA 7 Min1 2 TA Attchd 1978.0 Unf 2 500 TA TA Y 349 0 0 0 0 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub Inside Gtl Crawfor Norm Norm 1Fam 2Story 7 9 1941 2006 Gable CompShg CemntBd CmentBd None 0.0 Ex Gd Stone TA Gd No GLQ 275 Unf 0 877 1152 GasA ... Y SBrkr 1188 1152 0 2340 0 0 2 0 4 1 Gd 9 Typ 2 Gd Attchd 1941.0 RFn 1 252 TA TA Y 0 60 0 0 0 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub Inside Gtl NAmes Norm Norm 1Fam 1Story 5 6 1950 1996 Hip CompShg MetalSd MetalSd None 0.0 TA TA CBlock TA TA Mn GLQ 49 Rec 1029 0 1078 GasA ... Y FuseA 1078 0 0 1078 1 0 1 0 2 1 Gd 5 Typ 0 NaN Attchd 1950.0 Unf 1 240 TA TA Y 366 0 112 0 0 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub Inside Gtl Edwards Norm Norm 1Fam 1Story 5 6 1965 1965 Gable CompShg HdBoard HdBoard None 0.0 Gd TA CBlock TA TA No BLQ 830 LwQ 290 136 1256 GasA ... Y SBrkr 1256 0 0 1256 1 0 1 1 3 1 TA 6 Typ 0 NaN Attchd 1965.0 Fin 1 276 TA TA Y 736 68 0 0 0 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows × 81 columns

now we will se all our features each column is a feature , another description is type

In [69]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
In [70]:
import sweetviz as sw

house_prices_report = sw.analyze(train_df)
house_prices_report.show_notebook(layout='vertical')

Data exploration

First overview of the correlation between features

In [71]:
#correlation matrix
corrmat = train_df.corr()
f, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(corrmat, vmax=.8, square=True);

as you see its hard to understand the map , so we will reduce the heatmap to the top 10 correlation features with our target (SalePrice).

In [72]:
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
f, ax = plt.subplots(figsize=(12, 9))
cm = np.corrcoef(train_df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

now we will check for outliers and then remove them.

In [73]:
var = 'OverallQual'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.

We can see that there are no significant outliers between these features.

In [74]:
var = 'GrLivArea'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.

We can see that there are two out liers with low price and big GrLivArea - lets drop them.

In [75]:
train_df.sort_values(by = 'GrLivArea', ascending = False)[:2]
Out[75]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating ... CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1298 1299 60 RL 313.0 63887 Pave NaN IR3 Bnk AllPub Corner Gtl Edwards Feedr Norm 1Fam 2Story 10 5 2008 2008 Hip ClyTile Stucco Stucco Stone 796.0 Ex TA PConc Ex TA Gd GLQ 5644 Unf 0 466 6110 GasA ... Y SBrkr 4692 950 0 5642 2 0 2 1 3 1 Ex 12 Typ 3 Gd Attchd 2008.0 Fin 2 1418 TA TA Y 214 292 0 0 0 480 Gd NaN NaN 0 1 2008 New Partial 160000
523 524 60 RL 130.0 40094 Pave NaN IR1 Bnk AllPub Inside Gtl Edwards PosN PosN 1Fam 2Story 10 5 2007 2008 Hip CompShg CemntBd CmentBd Stone 762.0 Ex TA PConc Ex TA Gd GLQ 2260 Unf 0 878 3138 GasA ... Y SBrkr 3138 1538 0 4676 1 0 3 1 3 1 Ex 11 Typ 1 Gd BuiltIn 2007.0 Fin 3 884 TA TA Y 208 406 0 0 0 0 NaN NaN NaN 0 10 2007 New Partial 184750

2 rows × 81 columns

In [76]:
train_df = train_df.drop(train_df[train_df['Id'] == 1299].index)
train_df = train_df.drop(train_df[train_df['Id'] == 524].index)
train_df.reset_index(drop=True,inplace=True)

TotalBsmtSF: Total square feet of basement area

In [77]:
var = 'TotalBsmtSF'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.

We can see that there are no significant outliers between these features.

GarageCars: Size of garage in car capacity

In [78]:
var = 'GarageCars'
data = pd.concat([train_df['SalePrice'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
*c* argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with *x* & *y*.  Please use the *color* keyword-argument or provide a 2-D array with a single row if you intend to specify the same RGB or RGBA value for all points.

We can see that there are no significant outliers between these features.

Removing Features

In the sweetWiz we can see 'Utilities' in all houses is 'Allpub' except for 1. 'Condition2' feature have 99% 'Norm' and 'RoofMatl' have 98% "CompShq', so we`ll drop that features.

Heatmap show us that 'GarageArea' and 'GarageCars' are higly correlated. Same for '1stFlrSF' and 'TotalbsmtSF' and 'GarageYrBlt' and 'YearBlt'. we decided to drop one feature of each pair (keep the one with the higher correlation with the target). Futrhermore 'GarageFinish' and 'GarageCond' are droped because they giving the similar info to 'GarageQual'

In [79]:
train_df = train_df.drop(columns=['GarageYrBlt','GarageArea','1stFlrSF','GarageFinish','GarageCond','Utilities','Condition2','RoofMatl'])
test_df = test_df.drop(columns=['GarageYrBlt','GarageArea','1stFlrSF','GarageFinish','GarageCond','Utilities','Condition2','RoofMatl'])

Removing Id column and columns with too many missing data values.

In [80]:
total = train_df.isnull().sum().sort_values(ascending=False)
percent = (train_df.isnull().sum()/train_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
Out[80]:
Total Percent
PoolQC 1452 0.995885
MiscFeature 1404 0.962963
Alley 1367 0.937586
Fence 1177 0.807270
FireplaceQu 690 0.473251
LotFrontage 259 0.177641
GarageType 81 0.055556
GarageQual 81 0.055556
BsmtExposure 38 0.026063
BsmtFinType2 38 0.026063
BsmtQual 37 0.025377
BsmtFinType1 37 0.025377
BsmtCond 37 0.025377
MasVnrArea 8 0.005487
MasVnrType 8 0.005487
Electrical 1 0.000686
Exterior2nd 0 0.000000
Exterior1st 0 0.000000
SalePrice 0 0.000000
ExterQual 0 0.000000

now we will drop all columns with more than 20% missing values

In [81]:
train_df = train_df.drop(columns=['Id','Alley','PoolQC','Fence','MiscFeature','FireplaceQu'])
test_df = test_df.drop(columns=['Id','Alley','PoolQC','Fence','MiscFeature','FireplaceQu'])
In [82]:
def get_cmap(n, name='plasma'):
    return plt.cm.get_cmap(name, n)

# plot target values by each feature
def plot_target_values_by_each_feature(df, target_column_name):
    nrows = math.ceil(math.sqrt(len(df.columns)-1))
    ncols = math.ceil((len(df.columns)-1)/nrows)
    plt.style.use('seaborn')
    fig, axes = plt.subplots(nrows, ncols)
    plt.subplots_adjust(top=3, bottom=0, left=0, right=2.5)
    colors = get_cmap(len(df.columns))

    counter = 0
    for i in range(len(df.columns)-1):
        df.plot(kind='scatter', x=df.columns[i], y=target_column_name, ax=axes[i//nrows, i%nrows], color=colors(i))
        axes[i//nrows, i%nrows].tick_params(axis='both', labelsize=10)
        axes[i//nrows, i%nrows].xaxis.label.set_size(10)
        axes[i//nrows, i%nrows].yaxis.label.set_size(10)
        axes[i//nrows, i%nrows].title.set_fontsize(10)

    for i in range(len(df.columns)-1, nrows*ncols):
        fig.delaxes(axes.flatten()[i])
In [83]:
numerical_cols = train_df.select_dtypes(include=['int64', 'float64']).columns
df_numerical = train_df[numerical_cols]
plot_target_values_by_each_feature(df_numerical, 'SalePrice')

Conclusions

'YearBuilt' and 'YearRemodAdd'

We will combine them because they seem to affect the target in similar way.

In [84]:
for df in [train_df, test_df]:
  df["YearLstCnst"] = df[["YearBuilt", "YearRemodAdd"]].max(axis=1)
train_df = train_df.drop(columns = ["YearBuilt", "YearRemodAdd"])
test_df = test_df.drop(columns = ["YearBuilt", "YearRemodAdd"])
display(train_df["YearLstCnst"])
0       2003
1       1976
2       2002
3       1970
4       2000
        ... 
1453    2000
1454    1988
1455    2006
1456    1996
1457    1965
Name: YearLstCnst, Length: 1458, dtype: int64

'EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'

None of them have good correlation with the price, i will drop them.

In [85]:
train_df = train_df.drop(columns=['EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'])
test_df = test_df.drop(columns=['EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'])

KitchenAbvGr

It seem that the number of kitchens above ground dosent affect the price, in addition most of the houses have only one. I will drop that column.

In [86]:
train_df = train_df.drop(columns=['KitchenAbvGr'])
test_df = test_df.drop(columns=['KitchenAbvGr'])
In [87]:
total = train_df.isnull().sum().sort_values(ascending=False)
percent = (train_df.isnull().sum()/train_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
Out[87]:
Total Percent
LotFrontage 259 0.177641
GarageQual 81 0.055556
GarageType 81 0.055556
BsmtFinType2 38 0.026063
BsmtExposure 38 0.026063
BsmtCond 37 0.025377
BsmtQual 37 0.025377
BsmtFinType1 37 0.025377
MasVnrArea 8 0.005487
MasVnrType 8 0.005487
Electrical 1 0.000686
Street 0 0.000000
RoofStyle 0 0.000000
Foundation 0 0.000000
ExterCond 0 0.000000
ExterQual 0 0.000000
Exterior2nd 0 0.000000
Exterior1st 0 0.000000
OverallCond 0 0.000000
LotShape 0 0.000000

Filling missing values

LotFrontage : Since the area of each street connected to the house property most likely have a similar area to other houses in its neighborhood , we can fill in missing values by the median LotFrontage of the neighborhood.

In [88]:
train_df["LotFrontage"] = train_df.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

test_df["LotFrontage"] = test_df.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 :
For all these categorical basement-related features, NaN means that there is no basement

In [89]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    train_df[col] = train_df[col].fillna('None')

for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    test_df[col] = test_df[col].fillna('None')

MasVnrArea and MasVnrType :
NA most likely means no masonry veneer for these houses.
We can fill 0 for the area and None for the type.

In [90]:
train_df["MasVnrType"] = train_df["MasVnrType"].fillna("None")
train_df["MasVnrArea"] = train_df["MasVnrArea"].fillna(0)
test_df["MasVnrType"] = test_df["MasVnrType"].fillna("None")
test_df["MasVnrArea"] = test_df["MasVnrArea"].fillna(0)

GarageType and GarageQual : Replacing missing data with None

In [91]:
for col in ('GarageType', 'GarageQual'):
    train_df[col] = train_df[col].fillna('None')
    
for col in ('GarageType', 'GarageQual'):
    test_df[col] = test_df[col].fillna('None')

'Electrical' has only 1 missing value, so i fill it with the most frequent value.

In [92]:
train_df['Electrical'] = train_df['Electrical'].fillna(train_df['Electrical'].mode()[0])
test_df['Electrical'] = test_df['Electrical'].fillna(test_df['Electrical'].mode()[0])

Transforming some numerical variables that are really categorical

In [93]:
train_df['MSSubClass'] = train_df['MSSubClass'].apply(str)
test_df['MSSubClass'] = test_df['MSSubClass'].apply(str)

Handeling with the test data

Filling test missing values

In [94]:
test_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 57 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1459 non-null   object 
 1   MSZoning       1455 non-null   object 
 2   LotFrontage    1459 non-null   float64
 3   LotArea        1459 non-null   int64  
 4   Street         1459 non-null   object 
 5   LotShape       1459 non-null   object 
 6   LandContour    1459 non-null   object 
 7   LotConfig      1459 non-null   object 
 8   LandSlope      1459 non-null   object 
 9   Neighborhood   1459 non-null   object 
 10  Condition1     1459 non-null   object 
 11  BldgType       1459 non-null   object 
 12  HouseStyle     1459 non-null   object 
 13  OverallQual    1459 non-null   int64  
 14  OverallCond    1459 non-null   int64  
 15  RoofStyle      1459 non-null   object 
 16  Exterior1st    1458 non-null   object 
 17  Exterior2nd    1458 non-null   object 
 18  MasVnrType     1459 non-null   object 
 19  MasVnrArea     1459 non-null   float64
 20  ExterQual      1459 non-null   object 
 21  ExterCond      1459 non-null   object 
 22  Foundation     1459 non-null   object 
 23  BsmtQual       1459 non-null   object 
 24  BsmtCond       1459 non-null   object 
 25  BsmtExposure   1459 non-null   object 
 26  BsmtFinType1   1459 non-null   object 
 27  BsmtFinSF1     1458 non-null   float64
 28  BsmtFinType2   1459 non-null   object 
 29  BsmtFinSF2     1458 non-null   float64
 30  BsmtUnfSF      1458 non-null   float64
 31  TotalBsmtSF    1458 non-null   float64
 32  Heating        1459 non-null   object 
 33  HeatingQC      1459 non-null   object 
 34  CentralAir     1459 non-null   object 
 35  Electrical     1459 non-null   object 
 36  2ndFlrSF       1459 non-null   int64  
 37  LowQualFinSF   1459 non-null   int64  
 38  GrLivArea      1459 non-null   int64  
 39  BsmtFullBath   1457 non-null   float64
 40  BsmtHalfBath   1457 non-null   float64
 41  FullBath       1459 non-null   int64  
 42  HalfBath       1459 non-null   int64  
 43  BedroomAbvGr   1459 non-null   int64  
 44  KitchenQual    1458 non-null   object 
 45  TotRmsAbvGrd   1459 non-null   int64  
 46  Functional     1457 non-null   object 
 47  Fireplaces     1459 non-null   int64  
 48  GarageType     1459 non-null   object 
 49  GarageCars     1458 non-null   float64
 50  GarageQual     1459 non-null   object 
 51  PavedDrive     1459 non-null   object 
 52  WoodDeckSF     1459 non-null   int64  
 53  OpenPorchSF    1459 non-null   int64  
 54  SaleType       1458 non-null   object 
 55  SaleCondition  1459 non-null   object 
 56  YearLstCnst    1459 non-null   int64  
dtypes: float64(9), int64(14), object(34)
memory usage: 649.8+ KB
In [95]:
total = test_df.isnull().sum().sort_values(ascending=False)
percent = (test_df.isnull().sum()/test_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
Out[95]:
Total Percent
MSZoning 4 0.002742
BsmtHalfBath 2 0.001371
Functional 2 0.001371
BsmtFullBath 2 0.001371
Exterior1st 1 0.000685
Exterior2nd 1 0.000685
TotalBsmtSF 1 0.000685
KitchenQual 1 0.000685
BsmtUnfSF 1 0.000685
BsmtFinSF2 1 0.000685
GarageCars 1 0.000685
BsmtFinSF1 1 0.000685
SaleType 1 0.000685
OverallQual 0 0.000000
MasVnrType 0 0.000000
MasVnrArea 0 0.000000
ExterQual 0 0.000000
RoofStyle 0 0.000000
OverallCond 0 0.000000
YearLstCnst 0 0.000000

MSZoning

In [96]:
test_df['MSZoning'].describe()
Out[96]:
count     1455
unique       5
top         RL
freq      1114
Name: MSZoning, dtype: object

we can see that 77% of the values in that feature is RL, then i will fill the NA's with RL.

In [97]:
test_df['MSZoning'] = test_df['MSZoning'].fillna('RL')

'BsmtHalfBath', 'BsmtFullBath', 'TotalBsmtSF', 'BsmtUnfSF', 'BsmtFinSF2' and 'BsmtFinSF1'

none in theese features means there is no basement

In [98]:
test_df[['BsmtHalfBath','BsmtFullBath','TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1']].describe()
Out[98]:
BsmtHalfBath BsmtFullBath TotalBsmtSF BsmtUnfSF BsmtFinSF2 BsmtFinSF1
count 1457.000000 1457.000000 1458.000000 1458.000000 1458.000000 1458.000000
mean 0.065202 0.434454 1046.117970 554.294925 52.619342 439.203704
std 0.252468 0.530648 442.898624 437.260486 176.753926 455.268042
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 784.000000 219.250000 0.000000 0.000000
50% 0.000000 0.000000 988.000000 460.000000 0.000000 350.500000
75% 0.000000 1.000000 1305.000000 797.750000 0.000000 753.500000
max 2.000000 3.000000 5095.000000 2140.000000 1526.000000 4010.000000
In [99]:
for col in ('BsmtHalfBath','BsmtFullBath','TotalBsmtSF','BsmtUnfSF','BsmtFinSF2','BsmtFinSF1'):
    test_df[col] = test_df[col].fillna(0)

Functional

In [100]:
test_df['Functional'].describe()
Out[100]:
count     1457
unique       7
top        Typ
freq      1357
Name: Functional, dtype: object

93% of the values is Typ

In [101]:
test_df['Functional'] = test_df['Functional'].fillna('Typ')

SaleType

In [102]:
test_df['SaleType'].describe()
Out[102]:
count     1458
unique       9
top         WD
freq      1258
Name: SaleType, dtype: object
In [103]:
test_df["SaleType"] = test_df["SaleType"].fillna("WD")

GarageCars

In [104]:
test_df['GarageCars'].describe()
Out[104]:
count    1458.000000
mean        1.766118
std         0.775945
min         0.000000
25%         1.000000
50%         2.000000
75%         2.000000
max         5.000000
Name: GarageCars, dtype: float64
In [105]:
test_df["GarageCars"] = test_df["GarageCars"].fillna(test_df["GarageCars"].mean())

Exterior1st and Exterior2nd

In [106]:
test_df[['Exterior1st', 'Exterior2nd']].describe()
Out[106]:
Exterior1st Exterior2nd
count 1458 1458
unique 13 15
top VinylSd VinylSd
freq 510 510
In [107]:
test_df["Exterior1st"] = test_df["Exterior1st"].fillna("VinylSd")
test_df["Exterior2nd"] = test_df["Exterior2nd"].fillna("VinylSd")

KitchenQual

In [108]:
test_df['KitchenQual'].describe()
Out[108]:
count     1458
unique       4
top         TA
freq       757
Name: KitchenQual, dtype: object
In [109]:
test_df["KitchenQual"] = test_df["KitchenQual"].fillna("TA")

checking if all data is ready

In [110]:
test_df.isna().any()
Out[110]:
MSSubClass       False
MSZoning         False
LotFrontage      False
LotArea          False
Street           False
LotShape         False
LandContour      False
LotConfig        False
LandSlope        False
Neighborhood     False
Condition1       False
BldgType         False
HouseStyle       False
OverallQual      False
OverallCond      False
RoofStyle        False
Exterior1st      False
Exterior2nd      False
MasVnrType       False
MasVnrArea       False
ExterQual        False
ExterCond        False
Foundation       False
BsmtQual         False
BsmtCond         False
BsmtExposure     False
BsmtFinType1     False
BsmtFinSF1       False
BsmtFinType2     False
BsmtFinSF2       False
BsmtUnfSF        False
TotalBsmtSF      False
Heating          False
HeatingQC        False
CentralAir       False
Electrical       False
2ndFlrSF         False
LowQualFinSF     False
GrLivArea        False
BsmtFullBath     False
BsmtHalfBath     False
FullBath         False
HalfBath         False
BedroomAbvGr     False
KitchenQual      False
TotRmsAbvGrd     False
Functional       False
Fireplaces       False
GarageType       False
GarageCars       False
GarageQual       False
PavedDrive       False
WoodDeckSF       False
OpenPorchSF      False
SaleType         False
SaleCondition    False
YearLstCnst      False
dtype: bool

Training

In [111]:
from tqdm.auto import tqdm

def find_generator_len(generator, use_pbar=True):
    i = 0
    
    if use_pbar:
        pbar = tqdm(desc='Calculating Length', ncols=1000, bar_format='{desc}{bar:10}{r_bar}')

    for a in generator:
        i += 1

        if use_pbar:
            pbar.update()

    if use_pbar:
        pbar.close()

    return i

First, lets split that data to features and target.

In [112]:
t = train_df['SalePrice'].copy()
X = train_df.drop(['SalePrice'], axis=1).copy()
print('t')
display(t)
print()
print('X')
display(X)
t
0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1453    175000
1454    210000
1455    266500
1456    142125
1457    147500
Name: SalePrice, Length: 1458, dtype: int64
X
MSSubClass MSZoning LotFrontage LotArea Street LotShape LandContour LotConfig LandSlope Neighborhood Condition1 BldgType HouseStyle OverallQual OverallCond RoofStyle Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces GarageType GarageCars GarageQual PavedDrive WoodDeckSF OpenPorchSF SaleType SaleCondition YearLstCnst
0 60 RL 65.0 8450 Pave Reg Lvl Inside Gtl CollgCr Norm 1Fam 2Story 7 5 Gable VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 854 0 1710 1 0 2 1 3 Gd 8 Typ 0 Attchd 2 TA Y 0 61 WD Normal 2003
1 20 RL 80.0 9600 Pave Reg Lvl FR2 Gtl Veenker Feedr 1Fam 1Story 6 8 Gable MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 0 0 1262 0 1 2 0 3 TA 6 Typ 1 Attchd 2 TA Y 298 0 WD Normal 1976
2 60 RL 68.0 11250 Pave IR1 Lvl Inside Gtl CollgCr Norm 1Fam 2Story 7 5 Gable VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 866 0 1786 1 0 2 1 3 Gd 6 Typ 1 Attchd 2 TA Y 0 42 WD Normal 2002
3 70 RL 60.0 9550 Pave IR1 Lvl Corner Gtl Crawfor Norm 1Fam 2Story 7 5 Gable Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 756 0 1717 1 0 1 0 3 Gd 7 Typ 1 Detchd 3 TA Y 0 35 WD Abnorml 1970
4 60 RL 84.0 14260 Pave IR1 Lvl FR2 Gtl NoRidge Norm 1Fam 2Story 8 5 Gable VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1053 0 2198 1 0 2 1 4 Gd 9 Typ 1 Attchd 3 TA Y 192 84 WD Normal 2000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 60 RL 62.0 7917 Pave Reg Lvl Inside Gtl Gilbert Norm 1Fam 2Story 6 5 Gable VinylSd VinylSd None 0.0 TA TA PConc Gd TA No Unf 0 Unf 0 953 953 GasA Ex Y SBrkr 694 0 1647 0 0 2 1 3 TA 7 Typ 1 Attchd 2 TA Y 0 40 WD Normal 2000
1454 20 RL 85.0 13175 Pave Reg Lvl Inside Gtl NWAmes Norm 1Fam 1Story 6 6 Gable Plywood Plywood Stone 119.0 TA TA CBlock Gd TA No ALQ 790 Rec 163 589 1542 GasA TA Y SBrkr 0 0 2073 1 0 2 0 3 TA 7 Min1 2 Attchd 2 TA Y 349 0 WD Normal 1988
1455 70 RL 66.0 9042 Pave Reg Lvl Inside Gtl Crawfor Norm 1Fam 2Story 7 9 Gable CemntBd CmentBd None 0.0 Ex Gd Stone TA Gd No GLQ 275 Unf 0 877 1152 GasA Ex Y SBrkr 1152 0 2340 0 0 2 0 4 Gd 9 Typ 2 Attchd 1 TA Y 0 60 WD Normal 2006
1456 20 RL 68.0 9717 Pave Reg Lvl Inside Gtl NAmes Norm 1Fam 1Story 5 6 Hip MetalSd MetalSd None 0.0 TA TA CBlock TA TA Mn GLQ 49 Rec 1029 0 1078 GasA Gd Y FuseA 0 0 1078 1 0 1 0 2 Gd 5 Typ 0 Attchd 1 TA Y 366 0 WD Normal 1996
1457 20 RL 75.0 9937 Pave Reg Lvl Inside Gtl Edwards Norm 1Fam 1Story 5 6 Gable HdBoard HdBoard None 0.0 Gd TA CBlock TA TA No BLQ 830 LwQ 290 136 1256 GasA Gd Y SBrkr 0 0 1256 1 0 1 1 3 TA 6 Typ 0 Attchd 1 TA Y 736 68 WD Normal 1965

1458 rows × 57 columns

In [113]:
# calculate score and loss from cv (KFold) and display graphs
from sklearn.model_selection import KFold
def get_cv_score_and_loss(X, t, model, k, show_score_loss_graphs=False, use_pbar=True):
    scores_losses_df = pd.DataFrame(columns=['fold_id', 'split', 'score', 'loss'])

    cv = KFold(n_splits=k, shuffle=True, random_state=1)

    if use_pbar:
        pbar = tqdm(desc='Computing Models', total=find_generator_len(cv.split(X)))

    for i,(train_ids, val_ids) in enumerate(cv.split(X)):

        X_train = X.loc[train_ids]
        t_train = t.loc[train_ids]
        X_val = X.loc[val_ids]
        t_val = t.loc[val_ids]

        model.fit(X_train, t_train)

        y_train = model.predict(X_train)
        y_val = model.predict(X_val)
        scores_losses_df.loc[len(scores_losses_df)] = [i, 'train', model.score(X_train, t_train), mean_squared_error(t_train, y_train, squared=False)]
        scores_losses_df.loc[len(scores_losses_df)] = [i, 'val', model.score(X_val, t_val), mean_squared_error(t_val, y_val, squared=False)]

        if use_pbar:
            pbar.update()

    if use_pbar:
        pbar.close()


    val_scores_losses_df = scores_losses_df[scores_losses_df['split']=='val']
    train_scores_losses_df = scores_losses_df[scores_losses_df['split']=='train']

    mean_val_score = val_scores_losses_df['score'].mean()
    mean_val_loss = val_scores_losses_df['loss'].mean()
    mean_train_score = train_scores_losses_df['score'].mean()
    mean_train_loss = train_scores_losses_df['loss'].mean()

    if show_score_loss_graphs:
        fig = px.line(scores_losses_df, x='fold_id', y='score', color='split', title=f'Mean Val Score: {mean_val_score:.2f}, Mean Train Score: {mean_train_score:.2f}')
        fig.show()
        fig = px.line(scores_losses_df, x='fold_id', y='loss', color='split', title=f'Mean Val Loss: {mean_val_loss:.2f}, Mean Train Loss: {mean_train_loss:.2f}')
        fig.show()

    return mean_val_score, mean_val_loss, mean_train_score, mean_train_loss

10-Fold CV.

We will encode the categorical features with OHE and standard scalar on the numerical features.

In [114]:
from sklearn.compose import ColumnTransformer
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
ct = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
    ("standard", StandardScaler(), numerical_cols)])
model_pipe = make_pipeline(ct, SGDRegressor(random_state=1))
# model = SGDRegressor(random_state=1)
val_score, val_loss, train_score, train_loss = get_cv_score_and_loss(X, t, model_pipe, k=10, show_score_loss_graphs=True)
print(f'mean cv val score: {val_score:.2f}\nmean cv val loss {val_loss:.2f}')
print(f'mean cv train score: {train_score:.2f}\nmean cv train loss {train_loss:.2f}')


mean cv val score: 0.91
mean cv val loss 24125.05
mean cv train score: 0.93
mean cv train loss 21497.87

Feature selection

We will use Scikit-learn RFECV that is based on the Backward Feature Selection.
The default CV is 5-fold cross-validation.
We will enter the Scikit-learn RepeatedKFold to repeat each KFold a few times with different splits.

In [115]:
# find best subset of features on this dataset

from sklearn.feature_selection import RFECV
from sklearn.model_selection import RepeatedKFold
df = X.copy()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
    ("encoding", OrdinalEncoder(), categorical_cols),
    ("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X, t),columns=all_cols)
# model_pipe = make_pipeline(ct, SGDRegressor(random_state=1))
selector = RFECV(SGDRegressor(random_state=1), cv=RepeatedKFold(n_splits=5, n_repeats=5, random_state=1)).fit(X_encoded, t)
display(X_encoded.loc[:, selector.support_])
best_features = selector.support_


fig = go.Figure()
fig.add_trace(go.Scatter(x=[i for i in range(1, len(selector.grid_scores_) + 1)], y=selector.grid_scores_))
fig.update_xaxes(title_text="Number of features selected")
fig.update_yaxes(title_text="Cross validation score (nb of correct classifications)")
fig.show()
print(X.loc[:, best_features].keys())
print("Number of features: {}".format(len(X.loc[:, best_features].keys())))
MSZoning Street LandContour LandSlope RoofStyle MasVnrType ExterQual ExterCond Foundation BsmtQual BsmtFinType1 Heating CentralAir KitchenQual Functional GarageQual SaleCondition LotArea OverallQual OverallCond MasVnrArea TotalBsmtSF GrLivArea BedroomAbvGr GarageCars
0 3.0 1.0 3.0 0.0 1.0 1.0 2.0 4.0 2.0 2.0 2.0 1.0 1.0 2.0 6.0 5.0 4.0 -0.203934 0.658506 -0.517649 0.523937 -0.473766 0.393013 0.163894 0.313159
1 3.0 1.0 3.0 0.0 1.0 2.0 3.0 4.0 1.0 2.0 0.0 1.0 1.0 3.0 6.0 5.0 4.0 -0.087252 -0.068293 2.177825 -0.570739 0.504925 -0.489391 0.163894 0.313159
2 3.0 1.0 3.0 0.0 1.0 1.0 2.0 4.0 2.0 2.0 2.0 1.0 1.0 2.0 6.0 5.0 4.0 0.080162 0.658506 -0.517649 0.334044 -0.319490 0.542706 0.163894 0.313159
3 3.0 1.0 3.0 0.0 1.0 2.0 3.0 4.0 0.0 4.0 0.0 1.0 1.0 2.0 6.0 5.0 0.0 -0.092325 0.658506 -0.517649 -0.570739 -0.714823 0.406800 0.163894 1.652119
4 3.0 1.0 3.0 0.0 1.0 1.0 2.0 4.0 2.0 2.0 2.0 1.0 1.0 2.0 6.0 5.0 4.0 0.385566 1.385305 -0.517649 1.384039 0.222888 1.354202 1.389320 1.652119
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 3.0 1.0 3.0 0.0 1.0 2.0 3.0 4.0 2.0 2.0 6.0 1.0 1.0 3.0 6.0 5.0 4.0 -0.258014 -0.068293 -0.517649 -0.570739 -0.239941 0.268925 0.163894 0.313159
1454 3.0 1.0 3.0 0.0 1.0 3.0 3.0 4.0 1.0 2.0 0.0 1.0 1.0 3.0 2.0 5.0 4.0 0.275478 -0.068293 0.380842 0.093885 1.179884 1.107996 0.163894 0.313159
1455 3.0 1.0 3.0 0.0 1.0 2.0 0.0 2.0 4.0 4.0 2.0 1.0 1.0 2.0 6.0 5.0 4.0 -0.143868 0.658506 3.076316 -0.570739 0.239762 1.633893 1.389320 -1.025802
1456 3.0 1.0 3.0 0.0 3.0 2.0 3.0 4.0 1.0 4.0 2.0 1.0 1.0 2.0 6.0 5.0 4.0 -0.075381 -0.795092 0.380842 -0.570739 0.061380 -0.851806 -1.061532 -1.025802
1457 3.0 1.0 3.0 0.0 1.0 2.0 2.0 4.0 1.0 4.0 1.0 1.0 1.0 3.0 6.0 5.0 4.0 -0.053059 -0.795092 0.380842 -0.570739 0.490461 -0.501208 0.163894 -1.025802

1458 rows × 25 columns

Index(['MSZoning', 'LotFrontage', 'Street', 'LandContour', 'BldgType',
       'OverallCond', 'RoofStyle', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'ExterCond', 'BsmtQual', 'BsmtExposure', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtUnfSF', 'HeatingQC', 'Electrical', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'HalfBath', 'TotRmsAbvGrd', 'GarageQual', 'OpenPorchSF'],
      dtype='object')
Number of features: 25
In [116]:
X_best = X.loc[:,best_features]
X_best
Out[116]:
MSZoning LotFrontage Street LandContour BldgType OverallCond RoofStyle Exterior1st Exterior2nd MasVnrType ExterCond BsmtQual BsmtExposure BsmtFinSF1 BsmtFinType2 BsmtUnfSF HeatingQC Electrical 2ndFlrSF LowQualFinSF GrLivArea HalfBath TotRmsAbvGrd GarageQual OpenPorchSF
0 RL 65.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd BrkFace TA Gd No 706 Unf 150 Ex SBrkr 854 0 1710 1 8 TA 61
1 RL 80.0 Pave Lvl 1Fam 8 Gable MetalSd MetalSd None TA Gd Gd 978 Unf 284 Ex SBrkr 0 0 1262 0 6 TA 0
2 RL 68.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd BrkFace TA Gd Mn 486 Unf 434 Ex SBrkr 866 0 1786 1 6 TA 42
3 RL 60.0 Pave Lvl 1Fam 5 Gable Wd Sdng Wd Shng None TA TA No 216 Unf 540 Gd SBrkr 756 0 1717 0 7 TA 35
4 RL 84.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd BrkFace TA Gd Av 655 Unf 490 Ex SBrkr 1053 0 2198 1 9 TA 84
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 RL 62.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd None TA Gd No 0 Unf 953 Ex SBrkr 694 0 1647 1 7 TA 40
1454 RL 85.0 Pave Lvl 1Fam 6 Gable Plywood Plywood Stone TA Gd No 790 Rec 589 TA SBrkr 0 0 2073 0 7 TA 0
1455 RL 66.0 Pave Lvl 1Fam 9 Gable CemntBd CmentBd None Gd TA No 275 Unf 877 Ex SBrkr 1152 0 2340 0 9 TA 60
1456 RL 68.0 Pave Lvl 1Fam 6 Hip MetalSd MetalSd None TA TA Mn 49 Rec 0 Gd FuseA 0 0 1078 0 5 TA 0
1457 RL 75.0 Pave Lvl 1Fam 6 Gable HdBoard HdBoard None TA TA No 830 LwQ 136 Gd SBrkr 0 0 1256 1 6 TA 68

1458 rows × 25 columns

Analysis

As we see feature selection selected 25 features as best score.

next step:

Now, let's check diffrent polynomial degrees on this dataset and then choose the one with the best result.

In [117]:
# show graph of score and loss by plynomial degree of numerical features
def show_degree_graphs_cv_train(X, t, model, k, max_degree=10):
    numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
    categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
    
    val_train_score_loss_df = pd.DataFrame(columns=['degree', 'split', 'score', 'loss'])
    for i in tqdm(range(1, max_degree), desc='Poly Degree'):
        ct_enc_std_poly = ColumnTransformer([
            ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
            ("standard_poly", make_pipeline(PolynomialFeatures(degree=i), StandardScaler()), numerical_cols)])
        model_pipe = make_pipeline(ct_enc_std_poly, model)
        # model_pipe = make_pipeline(PolynomialFeatures(degree=i),model)
        val_score, val_loss, train_score, train_loss = get_cv_score_and_loss(X, t, model_pipe, k=k, show_score_loss_graphs=False, use_pbar=False)
        val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'train', train_score, train_loss]
        val_train_score_loss_df.loc[len(val_train_score_loss_df)] = [i, 'cv', val_score, val_loss]

    fig = px.line(val_train_score_loss_df, x='degree', y='score', color='split')
    fig.show()
    fig = px.line(val_train_score_loss_df, x='degree', y='loss', color='split')
    fig.show()  
    
    max_val = val_train_score_loss_df["score"].max()
    best_degree = val_train_score_loss_df[val_train_score_loss_df["score"] == max_val]["degree"].to_numpy()[0]
    return best_degree

best_degree = show_degree_graphs_cv_train(X_best, t, SGDRegressor(random_state=1), k=10 ,max_degree=5)

Analysis:

In both graphs above score and loss the best result is 2 . The features that we selected chosen by the feature selection function.

Before, we tried to use all the 57 features that left after the data research, and it has lower score on the CV and on the final test.

next step :

Let`s change some hyper-parameters of the SGD.

In [118]:
def choose_best_lr(x,t):
  scores = pd.DataFrame(columns=["lr", "val_score", "val_loss", "train_score", "train_loss"])
  lr=0.0001
  for i in range(1000):
    selector = SGDRegressor(random_state=1, eta0=lr, learning_rate="constant").fit(x, t)
    mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=10, show_score_loss_graphs=False, use_pbar=False)
    if mean_val_score < 0:
      break
    scores.loc[len(scores)] = [lr, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
    lr += 0.0001

  fig = go.Figure()
  fig.add_trace(go.Scatter(x=scores["lr"], y=scores["val_score"]))
  fig.update_xaxes(title_text="Learning Rate")
  fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
  fig.show()

  max_val = scores["val_score"].max()
  best_lr = scores[scores["val_score"] == max_val]["lr"].to_numpy()[0]
  return best_lr, max_val

X_ = X.loc[:,best_features]
numerical_cols = X_.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
    ("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X_, t))

best_lr,best_score = choose_best_lr(X_encoded,t)
print("Best learning rate : {}".format(best_lr))
Best learning rate : 0.0011000000000000003

Regularization methods :

We will use diffrent regularization methods as we stuidied in class:

  • l1 = lasso
  • l2 = l1ridge
  • elasticnet = elasticNet
In [119]:
def choose_regularization(x,t):
  scores = pd.DataFrame(columns=["penalty", "val_score", "val_loss", "train_score", "train_loss"])
  for penalty in ['l1','l2','elasticnet']:
    selector = SGDRegressor(penalty=penalty ,random_state=1, eta0=best_lr, learning_rate="constant").fit(x, t)
    mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=10, show_score_loss_graphs=False, use_pbar=False)
    scores.loc[len(scores)] = [penalty, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
    

  fig = go.Figure()
  fig.add_trace(go.Scatter(x=scores["penalty"], y=scores["val_score"]))
  fig.update_xaxes(title_text="penalty")
  fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
  fig.show()

  max_val = scores["val_score"].max()
  best_penalty = scores[scores["val_score"] == max_val]["penalty"]
  return best_penalty, max_val
  
X_ = X.loc[:,best_features]
numerical_cols = X_.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
    ("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X_, t))

best_penalty,best_score = choose_regularization(X_encoded,t)
print("Best penalty : {}".format(best_penalty))
Best penalty : 1    l2
Name: penalty, dtype: object

We can see that l2 = ridge give us the best cv score.

In [120]:
best_penalty = 'l2'

Alpha parameter

Now we will do the same for the regularization parameter of the SGD.

In [57]:
def choose_best_alpha(x,t):
  scores = pd.DataFrame(columns=["alpha", "val_score", "val_loss", "train_score", "train_loss"])
  alpha=0.0001
  
  for i in tqdm(range(100), desc='Alpha'):
    selector = SGDRegressor(penalty = best_penalty, alpha=alpha, random_state=1, eta0=best_lr, learning_rate="constant").fit(x, t)
    mean_val_score, mean_val_loss, mean_train_score, mean_train_loss = get_cv_score_and_loss(x, t, selector, k=10, show_score_loss_graphs=False, use_pbar=False)
    if mean_val_score < 0:
      break
    scores.loc[len(scores)] = [alpha, mean_val_score, mean_val_loss, mean_train_score, mean_train_loss]
    alpha += 0.0001

  fig = go.Figure()
  fig.add_trace(go.Scatter(x=scores["alpha"], y=scores["val_score"]))
  fig.update_xaxes(title_text="alpha")
  fig.update_yaxes(title_text="Cross validation score (no. of correct classifications)")
  fig.show()

  max_val = scores["val_score"].max()
  best_alpha = scores[scores["val_score"] == max_val]["alpha"].to_numpy()[0]
  return best_alpha, max_val

X_ = X.loc[:,best_features]
numerical_cols = X_.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = X_.select_dtypes(include=['object', 'bool']).columns
all_cols = categorical_cols.tolist() + numerical_cols.tolist()
ct = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
    ("standard", StandardScaler(), numerical_cols)])
X_encoded = pd.DataFrame(ct.fit_transform(X_, t))

best_alpha,best_score = choose_best_alpha(X_encoded,t)
print("Best learning rate : {}".format(best_alpha))

Best learning rate : 0.0010000000000000002

Final model and submission

Now we will use the result of our research to build a model.

In [58]:
print("Number of features = {}, Degree = {}, Penalty = {}, Learning rate = {}, alpha = {}".format(len(X.loc[:, best_features].keys()),best_degree,best_penalty,best_lr, best_alpha))
Number of features = 25, Degree = 2, Penalty = l2, Learning rate = 0.0011000000000000003, alpha = 0.0010000000000000002
In [59]:
features = X.loc[:,best_features].keys()
X_train = X[features]
display(X_train)
MSZoning LotFrontage Street LandContour BldgType OverallCond RoofStyle Exterior1st Exterior2nd MasVnrType ExterCond BsmtQual BsmtExposure BsmtFinSF1 BsmtFinType2 BsmtUnfSF HeatingQC Electrical 2ndFlrSF LowQualFinSF GrLivArea HalfBath TotRmsAbvGrd GarageQual OpenPorchSF
0 RL 65.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd BrkFace TA Gd No 706 Unf 150 Ex SBrkr 854 0 1710 1 8 TA 61
1 RL 80.0 Pave Lvl 1Fam 8 Gable MetalSd MetalSd None TA Gd Gd 978 Unf 284 Ex SBrkr 0 0 1262 0 6 TA 0
2 RL 68.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd BrkFace TA Gd Mn 486 Unf 434 Ex SBrkr 866 0 1786 1 6 TA 42
3 RL 60.0 Pave Lvl 1Fam 5 Gable Wd Sdng Wd Shng None TA TA No 216 Unf 540 Gd SBrkr 756 0 1717 0 7 TA 35
4 RL 84.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd BrkFace TA Gd Av 655 Unf 490 Ex SBrkr 1053 0 2198 1 9 TA 84
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1453 RL 62.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd None TA Gd No 0 Unf 953 Ex SBrkr 694 0 1647 1 7 TA 40
1454 RL 85.0 Pave Lvl 1Fam 6 Gable Plywood Plywood Stone TA Gd No 790 Rec 589 TA SBrkr 0 0 2073 0 7 TA 0
1455 RL 66.0 Pave Lvl 1Fam 9 Gable CemntBd CmentBd None Gd TA No 275 Unf 877 Ex SBrkr 1152 0 2340 0 9 TA 60
1456 RL 68.0 Pave Lvl 1Fam 6 Hip MetalSd MetalSd None TA TA Mn 49 Rec 0 Gd FuseA 0 0 1078 0 5 TA 0
1457 RL 75.0 Pave Lvl 1Fam 6 Gable HdBoard HdBoard None TA TA No 830 LwQ 136 Gd SBrkr 0 0 1256 1 6 TA 68

1458 rows × 25 columns

In [143]:
ct = ColumnTransformer([
    ("encoding", OneHotEncoder(sparse=False, handle_unknown='ignore'), categorical_cols),
    ("standard", make_pipeline(PolynomialFeatures(degree=2), StandardScaler()), numerical_cols)])
model_pipe = make_pipeline(ct, SGDRegressor(penalty='elasticnet', random_state=1, alpha=best_alpha, eta0=best_lr, learning_rate="constant"))
model_pipe.fit(X_train,t)
Out[143]:
Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('encoding',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='ignore',
                                                                sparse=False),
                                                  Index(['MSZoning', 'Street', 'LandContour', 'BldgType', 'RoofStyle',
       'Exterior...
                 SGDRegressor(alpha=0.0010000000000000002, average=False,
                              early_stopping=False, epsilon=0.1,
                              eta0=0.0011000000000000003, fit_intercept=True,
                              l1_ratio=0.15, learning_rate='constant',
                              loss='squared_loss', max_iter=1000,
                              n_iter_no_change=5, penalty='elasticnet',
                              power_t=0.25, random_state=1, shuffle=True,
                              tol=0.001, validation_fraction=0.1, verbose=0,
                              warm_start=False))],
         verbose=False)
In [144]:
y_train = model_pipe.predict(X_train)

rmse = mean_squared_error(t, y_train, squared=False)
print("Final RMSE: {}".format(rmse))
Final RMSE: 25936.086806308813
In [145]:
X_test = test_df[features]
X_test
Out[145]:
MSZoning LotFrontage Street LandContour BldgType OverallCond RoofStyle Exterior1st Exterior2nd MasVnrType ExterCond BsmtQual BsmtExposure BsmtFinSF1 BsmtFinType2 BsmtUnfSF HeatingQC Electrical 2ndFlrSF LowQualFinSF GrLivArea HalfBath TotRmsAbvGrd GarageQual OpenPorchSF
0 RH 80.0 Pave Lvl 1Fam 6 Gable VinylSd VinylSd None TA TA No 468.0 LwQ 270.0 TA SBrkr 0 0 896 0 5 TA 0
1 RL 81.0 Pave Lvl 1Fam 6 Hip Wd Sdng Wd Sdng BrkFace TA TA No 923.0 Unf 406.0 TA SBrkr 0 0 1329 1 6 TA 36
2 RL 74.0 Pave Lvl 1Fam 5 Gable VinylSd VinylSd None TA Gd No 791.0 Unf 137.0 Gd SBrkr 701 0 1629 1 6 TA 34
3 RL 78.0 Pave Lvl 1Fam 6 Gable VinylSd VinylSd BrkFace TA TA No 602.0 Unf 324.0 Ex SBrkr 678 0 1604 1 7 TA 36
4 RL 43.0 Pave HLS TwnhsE 5 Gable HdBoard HdBoard None TA Gd No 263.0 Unf 1017.0 Ex SBrkr 0 0 1280 0 5 TA 82
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1454 RM 21.0 Pave Lvl Twnhs 7 Gable CemntBd CmentBd None TA TA No 0.0 Unf 546.0 Gd SBrkr 546 0 1092 1 5 None 0
1455 RM 21.0 Pave Lvl TwnhsE 5 Gable CemntBd CmentBd None TA TA No 252.0 Unf 294.0 TA SBrkr 546 0 1092 1 6 TA 24
1456 RL 160.0 Pave Lvl 1Fam 7 Gable VinylSd VinylSd None TA TA No 1224.0 Unf 0.0 Ex SBrkr 0 0 1224 0 7 TA 0
1457 RL 62.0 Pave Lvl 1Fam 5 Gable HdBoard Wd Shng None TA Gd Av 337.0 Unf 575.0 TA SBrkr 0 0 970 0 6 None 32
1458 RL 74.0 Pave Lvl 1Fam 5 Gable HdBoard HdBoard BrkFace TA Gd Av 758.0 Unf 238.0 Ex SBrkr 1004 0 2000 1 9 TA 48

1459 rows × 25 columns

In [146]:
X_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   MSZoning      1459 non-null   object 
 1   LotFrontage   1459 non-null   float64
 2   Street        1459 non-null   object 
 3   LandContour   1459 non-null   object 
 4   BldgType      1459 non-null   object 
 5   OverallCond   1459 non-null   int64  
 6   RoofStyle     1459 non-null   object 
 7   Exterior1st   1459 non-null   object 
 8   Exterior2nd   1459 non-null   object 
 9   MasVnrType    1459 non-null   object 
 10  ExterCond     1459 non-null   object 
 11  BsmtQual      1459 non-null   object 
 12  BsmtExposure  1459 non-null   object 
 13  BsmtFinSF1    1459 non-null   float64
 14  BsmtFinType2  1459 non-null   object 
 15  BsmtUnfSF     1459 non-null   float64
 16  HeatingQC     1459 non-null   object 
 17  Electrical    1459 non-null   object 
 18  2ndFlrSF      1459 non-null   int64  
 19  LowQualFinSF  1459 non-null   int64  
 20  GrLivArea     1459 non-null   int64  
 21  HalfBath      1459 non-null   int64  
 22  TotRmsAbvGrd  1459 non-null   int64  
 23  GarageQual    1459 non-null   object 
 24  OpenPorchSF   1459 non-null   int64  
dtypes: float64(3), int64(7), object(15)
memory usage: 285.1+ KB

Prediction

In [148]:
y_test =  model_pipe.predict(X_test)

submission = pd.DataFrame({
        "Id": test_id,
        "SalePrice": y_test
    })
submission.to_csv('submission5.csv', index=False)

Submissions

Conclusion

In this assginment were asked to predict the house prices in Ames, Lowa.
Our data is assembled from 80 features some of them were related and some didn`t , data exloring and research was important and because of the size of the features there was a lot of work ,finding out which of the features have the biggest potential to predict the targets value.

To achive that goal we used correlation methods and graphs that showed us some information about the relationship between the features and the target .

In addition, we used CV and K-Fold to find out the best degree, regularization method and some hyper-parameters such as alpha and learning rate,these all helped to learn the model in the best way.

insights:

  • choosing the best degree for the model has a major impact of the CV predict.
  • OHE on categorical features turn out to be mistake because its returned a lot of features .
  • Training is very important to our result
In [ ]: